Data Mining CS – 536 Group # 27

Submitted to: Dr. Asim Karim

Group Members:

  • Fazila Sadia 19140006
  • Umar Farooq 19030028
  • Masooma Bukhari 18030025
  • Syed Ali Umair Tirmizi 18030004
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    precision_score,
    recall_score,
    roc_auc_score,
    confusion_matrix,
)

A. Library Usage Note

We are using missinginfo library to visualize missing rows in data.

In [3]:
import sys
!{sys.executable} -m pip install missingno
!{sys.executable} -m pip install sklearn_extra
import missingno as msno 
Requirement already satisfied: missingno in /opt/anaconda3/lib/python3.7/site-packages (0.4.2)
Requirement already satisfied: numpy in /opt/anaconda3/lib/python3.7/site-packages (from missingno) (1.17.2)
Requirement already satisfied: matplotlib in /opt/anaconda3/lib/python3.7/site-packages (from missingno) (3.1.1)
Requirement already satisfied: scipy in /opt/anaconda3/lib/python3.7/site-packages (from missingno) (1.3.1)
Requirement already satisfied: seaborn in /opt/anaconda3/lib/python3.7/site-packages (from missingno) (0.9.0)
Requirement already satisfied: cycler>=0.10 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (2.4.2)
Requirement already satisfied: python-dateutil>=2.1 in /opt/anaconda3/lib/python3.7/site-packages (from matplotlib->missingno) (2.8.0)
Requirement already satisfied: pandas>=0.15.2 in /opt/anaconda3/lib/python3.7/site-packages (from seaborn->missingno) (0.25.1)
Requirement already satisfied: six in /opt/anaconda3/lib/python3.7/site-packages (from cycler>=0.10->matplotlib->missingno) (1.12.0)
Requirement already satisfied: setuptools in /opt/anaconda3/lib/python3.7/site-packages (from kiwisolver>=1.0.1->matplotlib->missingno) (41.4.0)
Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/lib/python3.7/site-packages (from pandas>=0.15.2->seaborn->missingno) (2019.3)
Collecting sklearn_extra
  ERROR: Could not find a version that satisfies the requirement sklearn_extra (from versions: none)
ERROR: No matching distribution found for sklearn_extra

Deliverable 1

1. Data Load and Cleaning Thresholds

  • _row_removethreshold(%): Rows with Number of missing values(%) above this threshold will be removed. Initially we set this threshold to 70.
  • _dropthreshold(%): Columns having values more than this threshold will be preserved, other columns would be dropped. Initially we set this threshold to 60.
  • _chunksize: We are using a library to draw graph of missing values, this library can draw only 50 columns. so we devided whole columns to chunks of 50

The values in these thresholds are set wisely after so much experiments and cacluations. We'll justify these thresholds in detail in report.

In [4]:
survey_data=pd.read_csv("./data/data.csv",low_memory=False)
data_dictionaary = pd.read_excel('./data/Survey Data Dictionary.xlsx')

row_remove_threshold = 70 #we'll try differnnt threholds to see where we get better results for gender predction
drop_threshold = 60 #only columns with 60% Non-Empty values will be considered
chunk_size =  50 #for drawing graph of missing values

2. Data Visualization and Cleaning [Row wise]

Most of the people don't fill survey data willingly or skip the input fields, so it's great idea to keep only quality survey data. We have reasonable number of rows 20K+. We are cleaning data based on rows first then we deal with columns.

In [5]:
#rows_nulls_percentage: contains the percentage of missing values in every row 
rows_nulls_percentage = list((survey_data.isnull().sum(axis=1)/ len(survey_data.columns))*100)

#converting float to into int, so it would be easy to visualize the data
rows_nulls_percentage = [int(i) for i in rows_nulls_percentage]
y_axis = np.zeros(100)


#calculating rows missing percentages.For example row r1, r2, r3 has 1% missing values, then we are storing 3 against 1 percent
#so to say, X number of rows has x% of missing values
for perc in rows_nulls_percentage:
    y_axis[perc] = y_axis[perc] + 1
    


for percent in range(1, 100):
    print(y_axis[percent], "\tpeople skiped ",str(percent),"\b% (",str(((percent)*len(survey_data.columns))/100),") columns")
0.0 	people skiped  1 % ( 12.35 ) columns
0.0 	people skiped  2 % ( 24.7 ) columns
0.0 	people skiped  3 % ( 37.05 ) columns
0.0 	people skiped  4 % ( 49.4 ) columns
0.0 	people skiped  5 % ( 61.75 ) columns
0.0 	people skiped  6 % ( 74.1 ) columns
0.0 	people skiped  7 % ( 86.45 ) columns
0.0 	people skiped  8 % ( 98.8 ) columns
0.0 	people skiped  9 % ( 111.15 ) columns
0.0 	people skiped  10 % ( 123.5 ) columns
0.0 	people skiped  11 % ( 135.85 ) columns
0.0 	people skiped  12 % ( 148.2 ) columns
0.0 	people skiped  13 % ( 160.55 ) columns
0.0 	people skiped  14 % ( 172.9 ) columns
0.0 	people skiped  15 % ( 185.25 ) columns
0.0 	people skiped  16 % ( 197.6 ) columns
0.0 	people skiped  17 % ( 209.95 ) columns
0.0 	people skiped  18 % ( 222.3 ) columns
0.0 	people skiped  19 % ( 234.65 ) columns
0.0 	people skiped  20 % ( 247.0 ) columns
0.0 	people skiped  21 % ( 259.35 ) columns
0.0 	people skiped  22 % ( 271.7 ) columns
0.0 	people skiped  23 % ( 284.05 ) columns
0.0 	people skiped  24 % ( 296.4 ) columns
0.0 	people skiped  25 % ( 308.75 ) columns
0.0 	people skiped  26 % ( 321.1 ) columns
0.0 	people skiped  27 % ( 333.45 ) columns
0.0 	people skiped  28 % ( 345.8 ) columns
0.0 	people skiped  29 % ( 358.15 ) columns
0.0 	people skiped  30 % ( 370.5 ) columns
0.0 	people skiped  31 % ( 382.85 ) columns
0.0 	people skiped  32 % ( 395.2 ) columns
0.0 	people skiped  33 % ( 407.55 ) columns
0.0 	people skiped  34 % ( 419.9 ) columns
0.0 	people skiped  35 % ( 432.25 ) columns
0.0 	people skiped  36 % ( 444.6 ) columns
0.0 	people skiped  37 % ( 456.95 ) columns
0.0 	people skiped  38 % ( 469.3 ) columns
0.0 	people skiped  39 % ( 481.65 ) columns
0.0 	people skiped  40 % ( 494.0 ) columns
0.0 	people skiped  41 % ( 506.35 ) columns
0.0 	people skiped  42 % ( 518.7 ) columns
1.0 	people skiped  43 % ( 531.05 ) columns
0.0 	people skiped  44 % ( 543.4 ) columns
0.0 	people skiped  45 % ( 555.75 ) columns
1.0 	people skiped  46 % ( 568.1 ) columns
1.0 	people skiped  47 % ( 580.45 ) columns
2.0 	people skiped  48 % ( 592.8 ) columns
2.0 	people skiped  49 % ( 605.15 ) columns
8.0 	people skiped  50 % ( 617.5 ) columns
4.0 	people skiped  51 % ( 629.85 ) columns
13.0 	people skiped  52 % ( 642.2 ) columns
17.0 	people skiped  53 % ( 654.55 ) columns
22.0 	people skiped  54 % ( 666.9 ) columns
23.0 	people skiped  55 % ( 679.25 ) columns
16.0 	people skiped  56 % ( 691.6 ) columns
30.0 	people skiped  57 % ( 703.95 ) columns
49.0 	people skiped  58 % ( 716.3 ) columns
58.0 	people skiped  59 % ( 728.65 ) columns
126.0 	people skiped  60 % ( 741.0 ) columns
204.0 	people skiped  61 % ( 753.35 ) columns
336.0 	people skiped  62 % ( 765.7 ) columns
530.0 	people skiped  63 % ( 778.05 ) columns
932.0 	people skiped  64 % ( 790.4 ) columns
1700.0 	people skiped  65 % ( 802.75 ) columns
1983.0 	people skiped  66 % ( 815.1 ) columns
1675.0 	people skiped  67 % ( 827.45 ) columns
1466.0 	people skiped  68 % ( 839.8 ) columns
1222.0 	people skiped  69 % ( 852.15 ) columns
1226.0 	people skiped  70 % ( 864.5 ) columns
1361.0 	people skiped  71 % ( 876.85 ) columns
1240.0 	people skiped  72 % ( 889.2 ) columns
962.0 	people skiped  73 % ( 901.55 ) columns
1185.0 	people skiped  74 % ( 913.9 ) columns
1207.0 	people skiped  75 % ( 926.25 ) columns
1186.0 	people skiped  76 % ( 938.6 ) columns
1196.0 	people skiped  77 % ( 950.95 ) columns
43.0 	people skiped  78 % ( 963.3 ) columns
0.0 	people skiped  79 % ( 975.65 ) columns
0.0 	people skiped  80 % ( 988.0 ) columns
0.0 	people skiped  81 % ( 1000.35 ) columns
0.0 	people skiped  82 % ( 1012.7 ) columns
0.0 	people skiped  83 % ( 1025.05 ) columns
0.0 	people skiped  84 % ( 1037.4 ) columns
0.0 	people skiped  85 % ( 1049.75 ) columns
0.0 	people skiped  86 % ( 1062.1 ) columns
0.0 	people skiped  87 % ( 1074.45 ) columns
0.0 	people skiped  88 % ( 1086.8 ) columns
0.0 	people skiped  89 % ( 1099.15 ) columns
0.0 	people skiped  90 % ( 1111.5 ) columns
0.0 	people skiped  91 % ( 1123.85 ) columns
0.0 	people skiped  92 % ( 1136.2 ) columns
0.0 	people skiped  93 % ( 1148.55 ) columns
0.0 	people skiped  94 % ( 1160.9 ) columns
0.0 	people skiped  95 % ( 1173.25 ) columns
0.0 	people skiped  96 % ( 1185.6 ) columns
0.0 	people skiped  97 % ( 1197.95 ) columns
0.0 	people skiped  98 % ( 1210.3 ) columns
0.0 	people skiped  99 % ( 1222.65 ) columns

2.1. Survey Data Rows Fill Chart

As described in above cell, we are calculating rows missing percentages.For example row r1, r2, r3 has 1% missing values, then we are storing 3 against 1 percent. So to say, X number of rows has x% of missing values

In [6]:
plt.bar(range(len(y_axis)),y_axis)
plt.title('Missing Rows Spread in percentage')
plt.ylabel("No. Of People/Rows")
plt.xlabel("Missing Percentage")
Out[6]:
Text(0.5, 0, 'Missing Percentage')

2.2. Survey Data Rows Fill Chart [More Precise]

In [7]:
# more precise visualization of above graph
dd = pd.Index(rows_nulls_percentage)
p_c = (dd.value_counts(sort=False, ascending=True))
#print(p_c)


unique_percentages  = dd.unique()
unique_percentages.sort_values()

y_a = []
for u_per in unique_percentages:
    val = (dd[dd == u_per].value_counts())
    y_a.append(int(val))
    
y_a,unique_percentages
plt.bar(unique_percentages, y_a)
plt.title('Missing Row values Spread in percentage')
plt.ylabel("No. Of People/Rows")
plt.xlabel("Missing Percentage")
Out[7]:
Text(0.5, 0, 'Missing Percentage')
In [8]:
for index in range(len(unique_percentages)):
    print(y_a[index],"\tPeople skipped ",unique_percentages[index],"\b% fields")
1186 	People skipped  76 % fields
1222 	People skipped  69 % fields
1983 	People skipped  66 % fields
1361 	People skipped  71 % fields
1207 	People skipped  75 % fields
1196 	People skipped  77 % fields
530 	People skipped  63 % fields
1675 	People skipped  67 % fields
1466 	People skipped  68 % fields
962 	People skipped  73 % fields
1226 	People skipped  70 % fields
1700 	People skipped  65 % fields
1240 	People skipped  72 % fields
932 	People skipped  64 % fields
1185 	People skipped  74 % fields
49 	People skipped  58 % fields
126 	People skipped  60 % fields
58 	People skipped  59 % fields
336 	People skipped  62 % fields
17 	People skipped  53 % fields
30 	People skipped  57 % fields
204 	People skipped  61 % fields
13 	People skipped  52 % fields
43 	People skipped  78 % fields
16 	People skipped  56 % fields
23 	People skipped  55 % fields
22 	People skipped  54 % fields
8 	People skipped  50 % fields
1 	People skipped  43 % fields
1 	People skipped  47 % fields
2 	People skipped  48 % fields
2 	People skipped  49 % fields
1 	People skipped  46 % fields
4 	People skipped  51 % fields

2.3. Rows Removing/Droping

In the below cell indeces of the rows that doesn't match our filter criteria are calculated and then rows are dropped.

In [9]:
survey_data=pd.read_csv("./data/data.csv",low_memory=False)


rows_indeces_to_remove = []
for index, perc in enumerate(rows_nulls_percentage):
    if(perc>row_remove_threshold):
        rows_indeces_to_remove.append(index)


print(len(rows_indeces_to_remove),"rows will be removed with threshold",row_remove_threshold,"\b%")
#print(rows_nulls_percentage)

survey_data.drop(survey_data.index[rows_indeces_to_remove], inplace=True)
8380 rows will be removed with threshold 70 %

2.4. Shape of data set after removing rows

In [10]:
survey_data.shape
Out[10]:
(11647, 1235)

3. Data Visualization and Cleaning [Column wise]

Note: This cell executes with a waring and take about a minute to execute.

Every graph contains detals of missing cells for 50 colums(limitation of graph library). Black horizotal marker indicated a filled value otherwise a white space represents missing value.

In [11]:
cols = survey_data.columns
cols_length = len(cols)
for index in range(0,cols_length,chunk_size):
    set = survey_data[cols[index:index+chunk_size]]
    msno.matrix(set, inline=False)
/opt/anaconda3/lib/python3.7/site-packages/missingno/missingno.py:50: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
  plt.figure(figsize=figsize)

B. Issue: Dictionary mismatch

data.csv hay 1235 columns but we have details of 1105 columns in dictionary. In the cell below, columns that are in our data.csv (UNCLEANED), but not dictionary. We didn't remove them just becuase they have missing, we handled missing columns nicely but after dropping columns those do not match our criteria.

In [12]:
cols_in_data = survey_data.columns
print("Columns in Data.csv", len(cols_in_data))

cols_in_dic = len(data_dictionaary)
print("Columns in Dictionary.xlsx", cols_in_dic)

columns_in_dictionary = list(data_dictionaary['Column Name'])


missing_cols_in_dic = []
for col in cols_in_data:
    if col not in columns_in_dictionary:
        #print(col)
        missing_cols_in_dic.append(str(col))
        
print("Following are ",len(missing_cols_in_dic)," columns for which there's no information in dictionary:\n\n",missing_cols_in_dic)
#columns_in_dictionary
Columns in Data.csv 1235
Columns in Dictionary.xlsx 1105
Following are  160  columns for which there's no information in dictionary:

 ['AA4', 'AA7', 'AA14', 'AA15', 'Unnamed: 9', 'DG3A_OTHERS', 'DG4_OTHERS', 'DG10b', 'DG10c', 'DG11b', 'DG11c', 'DG13_OTHERS', 'DG14_OTHERS', 'DL1_OTHERS', 'DL2_23_OTHERS', 'DL2_96_OTHERS', 'DL4_96', 'DL4_99', 'DL4_OTHERS', 'DL12_OTHERS', 'DL14', 'DL28_OTHERS', 'G2P1_OTHERS', 'G2P2_96', 'G2P2_1_OTHERS', 'G2P2_2_OTHERS', 'G2P2_4_OTHERS', 'G2P2_6_OTHERS', 'G2P2_7_OTHERS', 'G2P2_8_OTHERS', 'G2P2_9_OTHERS', 'G2P2_10_OTHERS', 'G2P2_11_OTHERS', 'G2P2_12_OTHERS', 'G2P2_13_OTHERS', 'G2P2_14_OTHERS', 'G2P2_15_OTHERS', 'G2P2_16_OTHERS', 'G2P2_96_OTHERS', 'G2P3_1', 'G2P3_2', 'G2P3_3', 'G2P3_4', 'G2P3_5', 'G2P3_6', 'G2P3_7', 'G2P3_8', 'G2P3_9', 'G2P3_10', 'G2P3_11', 'G2P3_12', 'G2P3_13', 'G2P3_14', 'G2P3_15', 'G2P3_16', 'G2P3_96', 'MT3_1', 'MT3_2', 'MT3_3', 'MT5_OTHERS', 'MT6_OTHERS', 'MT6A_OTHERS', 'MT6B_OTHERS', 'MT7A_OTHERS', 'MT9_OTHERS', 'MT12_1', 'MT12_2', 'MT12_3', 'MT12_4', 'MT12_5', 'MT12_6', 'MT12_7', 'MT12_8', 'MT12_9', 'MT12_10', 'MT12_11', 'MT12_12', 'MT12_13', 'MT12_14', 'MT12_96', 'MT12_99', 'MT12_OTHERS', 'MT13_1_OTHERS', 'MT13_2_OTHERS', 'MT13_3_OTHERS', 'MT13_4_OTHERS', 'MT13_7_OTHERS', 'MT13_9_OTHERS', 'MT13_11_OTHERS', 'MT13_96_OTHERS', 'MT14_3_OTHERS', 'MT14_5_OTHERS', 'MT14_7_OTHERS', 'MT14B_OTHERS', 'MT16_OTHERS', 'MT18_OTHERS', 'FF2A_OTHERS', 'FF3_OTHERS', 'FF7_OTHERS', 'FF10_OTHERS', 'FF14_OTHERS', 'MM2_OTHERS', 'MM11_2_OTHERS', 'MM11_5_OTHERS', 'MM11_11_OTHERS', 'MM11_13_OTHERS', 'MM12_OTHERS', 'MM12_REC', 'MM13_OTHERS', 'MM13_REC', 'MM14_OTHERS', 'MM15_OTHERS', 'MM24_OTHERS', 'MM28_OTHERS', 'MM36_OTHERS', 'MM37_OTHERS', 'MM38_OTHERS', 'MM41_OTHERS', 'MMP1_OTHERS', 'IFI10_OTHERS', 'IFI11_OTHERS', 'IFI12_OTHERS', 'IFI16_1_OTHERS', 'IFI16_2_OTHERS', 'IFI16_3_OTHERS', 'IFI16_6_OTHERS', 'IFI16_7_OTHERS', 'IFI21_OTHERS', 'IFI24_OTHERS', 'FL3_OTHERS', 'FL4_OTHERS', 'FL9A_OTHERS', 'FL9B_OTHERS', 'FL10_OTHERS', 'FB4_OTHERS', 'FB16_5_OTHERS', 'FB16_96_OTHERS', 'FB19_96_OTHERS', 'FB19_6_OTHERS', 'FB19A_OTHERS', 'FB19B_OTHERS', 'FB22_7_OTHERS', 'FB22_96_OTHERS', 'FB24_OTHERS', 'FB25_OTHERS', 'FB26_OTHERS', 'FB27_OTHERS', 'FB28_1_OTHERS', 'FB28_2_OTHERS', 'FB28_3_OTHERS', 'FB28_4_OTHERS', 'FB28_96_OTHERS', 'FB29_OTHERS', 'LN2_RIndLngBEOth', 'LN2_WIndLngBEOth', 'GN1_OTHERS', 'GN2_OTHERS', 'GN3_OTHERS', 'GN4_OTHERS', 'GN5_OTHERS']

3.1. In the following cell, columns those don't match our criteria will be dropped

In [13]:
all_columns = survey_data.columns
data_size_N = survey_data.shape[0]
cols_with_missing_values_count = 0
complete_col_count = 0
null_col_count = 0
should_drop_count = 0


columns_to_drop = []

#count the columns where even 1 row is missing
for col in all_columns:
    current_col = survey_data[col]
    missing_values_count = current_col.isnull().sum()
    non_empty_values_percentage = ((data_size_N - missing_values_count)/data_size_N)*100
    if(missing_values_count==0):
        complete_col_count+=1
    if(missing_values_count==data_size_N):
        null_col_count+=1
    if non_empty_values_percentage <  drop_threshold:
        #print(non_empty_values_percentage, data_size_N,missing_values_count, col)
        should_drop_count += 1
        columns_to_drop.append(col)
        #print(col)
    if missing_values_count > 0:
        cols_with_missing_values_count += 1
        #print('Column ',col, ' has',missing_values_count,' missing values' )
        
remaining_columns_count = len(all_columns)-should_drop_count

print("Total ",cols_with_missing_values_count," columns have missing values not matching our criteria")
print("Total ",should_drop_count,"columns should be dropped with threshold",drop_threshold,"\b%")

print("Total columns with all filled values", complete_col_count, "\nTotal columns with all null values", null_col_count, "\nNumber of rows in data", data_size_N, "\nColumns will remain",remaining_columns_count )

len(columns_to_drop)
Total  997  columns have missing values not matching our criteria
Total  849 columns should be dropped with threshold 60 %
Total columns with all filled values 238 
Total columns with all null values 55 
Number of rows in data 11647 
Columns will remain 386
Out[13]:
849

3.2. Shape of Dataset after Removing Rows

In [14]:
data_filtered = survey_data.drop(columns_to_drop, axis=1)
data_filtered.shape
Out[14]:
(11647, 386)

3.3. Graph of missing values after dropping rows and columns

It can be seen we are left with good enough data to do gender prediciton. If we didn't drop the rows and columns and fill the columsn with filling method, we might have introduced bias in our data, which is not good for features building for gender prediction model.

In [15]:
cols = data_filtered.columns
cols_length = len(cols)
for index in range(0,cols_length,chunk_size):
    set = data_filtered[cols[index:index+chunk_size]]
    msno.matrix(set, inline=False)
In [16]:
def drawBarChartOfTypes(data_filtered, title):
    #print(data_filtered.applymap(type))
    columns_types = data_filtered.dtypes.unique()
    columns_types_counts = data_filtered.dtypes.value_counts()
    barchart_x = []
    barchart_y = []

    for  inedex, col_type_count in  enumerate(columns_types_counts):
        barchart_x.append(str(columns_types[inedex]))
        barchart_y.append(col_type_count)
        #print(columns_types[inedex], col_type_count)

    bars = plt.bar(barchart_x, barchart_y)
    plt.title(title+" , Total Columns: "+str(remaining_columns_count))
    
    
    for bar in bars:
        yval = bar.get_height()
        plt.text(bar.get_x(), yval + .05, yval)

4. Counts of Columns Datatypes

Counts of all unique data types for columns in data

In [17]:
drawBarChartOfTypes(data_filtered, 'Columns Types Counts')

5. Missing Columns in dictionary

Following cell prints the columns that are in our filtered data but we don't have any information about them in dictionary.

In [18]:
#these are not ALL missing columns but those we need but not in dictionary
columns_in_dictionary = list(data_dictionaary['Column Name'])
filtered_columns = data_filtered.columns
missing_columns_in_dictionary = []

for filtered_column in filtered_columns:
    #print(filtered_column)
    if filtered_column not in columns_in_dictionary:
        missing_columns_in_dictionary.append(filtered_column)
(missing_columns_in_dictionary)
#columns_in_dictionary
Out[18]:
['AA4',
 'AA7',
 'AA14',
 'AA15',
 'DL4_96',
 'DL4_99',
 'DL14',
 'MT3_1',
 'MT3_2',
 'MT3_3',
 'LN2_RIndLngBEOth',
 'LN2_WIndLngBEOth']
In [19]:
data_filtered_corr_matrix = data_filtered.corr()
In [20]:
#ax = sns.heatmap(data_filtered_corr_matrix, annot=True, cmap="YlGnBu")

6. Data Type conversion

After calcuating correlation matrix, we noticed that size of columns correlation matrix is not equal to size of columns of filtered_data. After analyzing, we found that we need to handle object data types to include in correlation matrix.

Even the columns has type object they still contains numeric values except two obj columns LN2_RIndLngBEOth, LN2_WIndLngBEOth, we decided to drop them, becuase these columns are location information and has nothing to do with gender. More justification in report.

In the column FB16_6 we found an garbage value 'Gujraati' at index [10619], as per dictionary this column is supposed to have only numeric values. So assuming the data might not be filled nicely for whole row we removed the row.

In [21]:
#warning! this column included delteions, don't execute this column again and again

missing_correlation_columns = []
for filtered_column in filtered_columns:
    if filtered_column not in data_filtered_corr_matrix:
        missing_correlation_columns.append(filtered_column)

#we identified missing colums in the correlation and found that correlation matrix does not include for object types
print("Missing Columns in Correlation Matrix", missing_correlation_columns) #actually these are columsn with type object, we need to convert them to numeric
print("Object type columns",list(data_filtered.select_dtypes(['object']).columns))



#if we execute the data type conversion satement below, it throws an error, becuase two columns named 'LN2_RIndLngBEOth', 'LN2_WIndLngBEOth' are nominal, we don't have information about these columns in dictionary, so we decided to drop them from data
unknown_cols_to_drop = ['LN2_RIndLngBEOth', 'LN2_WIndLngBEOth']
for uk_col in unknown_cols_to_drop:
    if uk_col in data_filtered:
        data_filtered = data_filtered.drop([uk_col], axis=1)
    if uk_col in missing_correlation_columns:
        missing_correlation_columns.remove(uk_col)



print("Change type of", missing_correlation_columns)


#while executing the conversion statment we get following error
#'Unable to parse string "Gujarati" at position 10619', 'occurred at index FB16_6'
#after analyzing from dictionary we got to know this value not from an acceptable value domain. so we decided to remove the row

#if data_filtered.FB16_6[10619] == "Gujarati": #this can be thougt of garbage value
data_filtered.drop(data_filtered.index[10619], inplace=True)
print(data_filtered.shape)
data_filtered[missing_correlation_columns] = data_filtered[missing_correlation_columns].apply(pd.to_numeric)
Missing Columns in Correlation Matrix ['FB16_5', 'FB16_6', 'LN2_RIndLngBEOth', 'LN2_WIndLngBEOth']
Object type columns ['FB16_5', 'FB16_6', 'LN2_RIndLngBEOth', 'LN2_WIndLngBEOth']
Change type of ['FB16_5', 'FB16_6']
(11646, 384)

6.1. Post conversion counts of columns data types

Counts of all unique data types for columns in data after data type conversion

In [22]:
drawBarChartOfTypes(data_filtered, "Columns Types Count after changing data types Obj=> Numeric")

6.2. Recomputation of Correlation Matrix

This time size of columns in correlation matrix matches with size of columns in data

In [23]:
#recompute correlation matrix
data_filtered_corr_matrix = data_filtered.corr()
print("Columns in correlation matrix", len(data_filtered_corr_matrix), "\nColumns in data", data_filtered.shape[1])
Columns in correlation matrix 384 
Columns in data 384
In [24]:
def drawCorrelationChart(col_name, data, title):
    
    #ax = sns.heatmap(data, annot=True, cmap="YlGnBu")
    #data = list(data)
    y_axis = list(data[col_name].values)
    #x_axis = list(data.columns)
    #print(y_axis)
    plt.bar(range(0,len(y_axis)), y_axis)
    plt.title(title)
    plt.xlabel('Column Number')
    plt.ylabel('Correlation')
    
    
    

6.3. Columns Correlation with Gender

As per project requirement our task is to predict gender. For further data cleaning we want know the columns correlation with gender.

In [25]:
drawCorrelationChart("Gender",data_filtered_corr_matrix, "Correlation Before Filling")

6.4. Correlated Columns with threshold

For now, we decided to not drop the columns with low or negative corelation because we expect our Gender predictor [Next deliverable] to handle negative weights as well. So we are just writing a method to filter columns with threhold correlation, jus in case if we need them for future.

In [26]:
def getCorrelatedColumns(corr_matrix, threshold):
    col_corr = []
    gender_correlation = corr_matrix['Gender']
    columns = corr_matrix.columns
    correlated_columns = [ columns[index] for index, cor in enumerate(gender_correlation) if cor > threshold]

    return correlated_columns
In [27]:
print(getCorrelatedColumns(data_filtered_corr_matrix, -0.3))
['Gender', 'AA3', 'AA4', 'AA6', 'AA7', 'AA14', 'AA15', 'DG1', 'DG3', 'DG3A', 'DG4', 'DG5_1', 'DG5_2', 'DG5_3', 'DG5_4', 'DG5_5', 'DG5_6', 'DG5_7', 'DG5_8', 'DG5_9', 'DG5_10', 'DG5_11', 'DG6', 'DG8a', 'DG8b', 'DG8c', 'DG9a', 'DL0', 'DL1', 'DL4_1', 'DL4_2', 'DL4_3', 'DL4_4', 'DL4_5', 'DL4_6', 'DL4_7', 'DL4_8', 'DL4_9', 'DL4_10', 'DL4_11', 'DL4_12', 'DL4_13', 'DL4_14', 'DL4_15', 'DL4_16', 'DL4_17', 'DL4_18', 'DL4_19', 'DL4_20', 'DL4_21', 'DL4_22', 'DL4_23', 'DL4_96', 'DL4_99', 'DL5', 'DL6', 'DL11', 'DL14', 'DL15', 'DL16', 'DL17', 'DL18', 'DL19', 'DL20', 'DL21', 'DL22', 'DL23', 'DL24', 'DL25_1', 'DL25_2', 'DL25_3', 'DL25_4', 'DL25_5', 'DL25_6', 'DL25_7', 'DL25_8', 'DL26_1', 'DL26_2', 'DL26_3', 'DL26_4', 'DL26_5', 'DL26_6', 'DL26_7', 'DL26_8', 'DL26_9', 'DL26_10', 'DL26_12', 'DL26_99', 'MT1', 'MT1A', 'MT2', 'MT3_1', 'MT3_2', 'MT3_3', 'MT4_1', 'MT4_2', 'MT4_3', 'MT4_4', 'MT4_5', 'MT4_6', 'MT5', 'MT6', 'MT6A', 'MT6B', 'MT6C', 'MT10', 'MT17_1', 'MT17_2', 'MT17_3', 'MT17_4', 'MT17_5', 'MT17_6', 'MT17_7', 'MT17_8', 'MT17_9', 'MT17_10', 'MT17_11', 'MT17_12', 'MT17_13', 'MT18_1', 'MT18_2', 'MT18_3', 'MT18_4', 'MT18_5', 'MT18_6', 'MT18_96', 'MT18_8', 'FF1', 'FF2', 'FF2A', 'FF5', 'FF6_1', 'FF6_2', 'FF6_3', 'FF6_4', 'FF6_5', 'FF6_6', 'FF6_7', 'FF6_8', 'FF6_9', 'FF6_10', 'FF7_1', 'FF7_2', 'FF7_3', 'FF7_4', 'FF7_5', 'FF7_6', 'FF7_7', 'FF7_96', 'FF9', 'FF10_1', 'FF10_2', 'FF10_3', 'FF10_4', 'FF10_5', 'FF10_6', 'FF10_96', 'FF13', 'FF14_1', 'FF14_2', 'FF14_3', 'FF14_4', 'FF14_5', 'FF14_6', 'FF14_7', 'FF14_8', 'FF14_9', 'FF14_10', 'FF14_11', 'FF14_12', 'FF14_13', 'FF14_14', 'FF14_15', 'FF14_16', 'FF14_17', 'FF14_18', 'FF14_19', 'FF14_20', 'FF14_21', 'FF14_22', 'FF14_23', 'FF14_96', 'FF16_1', 'FF16_2', 'FF19_1', 'FF19_2', 'FF19_3', 'FF19_4', 'FF19_5', 'FF19_6', 'FF19_7', 'FF19_8', 'MM1', 'MM2_1', 'MM2_2', 'MM2_3', 'MM2_4', 'MM2_5', 'MM2_6', 'MM2_7', 'MM2_8', 'MM2_9', 'MM2_10', 'MM2_11', 'MM2_12', 'MM2_13', 'MM2_14', 'MM2_15', 'MM3_1', 'MM3_2', 'MM3_3', 'MM3_4', 'MM3_5', 'MM3_6', 'MM3_7', 'MM3_8', 'MM3_9', 'MM3_10', 'MM3_11', 'MM3_12', 'MM3_13', 'MM3_14', 'MMP1_1', 'MMP1_2', 'MMP1_3', 'MMP1_4', 'MMP1_5', 'MMP1_6', 'MMP1_7', 'MMP1_8', 'MMP1_9', 'MMP1_10', 'MMP1_11', 'MMP1_96', 'IFI1_1', 'IFI3_1', 'IFI1_2', 'IFI3_2', 'IFI1_3', 'IFI3_3', 'IFI1_4', 'IFI1_5', 'IFI1_6', 'IFI1_7', 'IFI1_8', 'IFI1_9', 'IFI14_1', 'IFI14_2', 'IFI14_3', 'IFI14_4', 'IFI14_5', 'IFI14_6', 'IFI14_7', 'IFI15_1', 'IFI15_2', 'IFI15_3', 'IFI15_4', 'IFI15_5', 'IFI15_6', 'IFI15_7', 'IFI16_1', 'IFI16_2', 'IFI17_1', 'IFI17_2', 'IFI18', 'FL1', 'FL2', 'FL3', 'FL4', 'FL6_1', 'FL6_2', 'FL6_3', 'FL6_4', 'FL7_1', 'FL7_2', 'FL7_3', 'FL7_4', 'FL7_5', 'FL7_6', 'FL8_1', 'FL8_2', 'FL8_3', 'FL8_4', 'FL8_5', 'FL8_6', 'FL8_7', 'FL9A', 'FL9B', 'FL9C', 'FL10', 'FL11', 'FL12', 'FL13', 'FL14', 'FL15', 'FL16', 'FL17', 'FL18', 'FB1_1', 'FB1_2', 'FB1_3', 'FB2', 'FB3', 'FB13', 'FB16_1', 'FB16_2', 'FB16_3', 'FB16_4', 'FB16_5', 'FB16_6', 'FB16_7', 'FB16_8', 'FB16_96', 'FB18', 'FB19', 'FB19A_1', 'FB19A_2', 'FB19A_3', 'FB19A_4', 'FB19A_5', 'FB19A_96', 'FB19B_1', 'FB19B_2', 'FB19B_3', 'FB19B_4', 'FB19B_5', 'FB19B_96', 'FB20', 'FB22_1', 'FB22_2', 'FB22_3', 'FB22_4', 'FB22_5', 'FB22_6', 'FB22_7', 'FB22_8', 'FB22_9', 'FB22_10', 'FB22_11', 'FB22_12', 'FB22_96', 'FB26_1', 'FB26_2', 'FB26_3', 'FB26_4', 'FB26_5', 'FB26_6', 'FB26_7', 'FB26_8', 'FB26_9', 'FB26_10', 'FB26_11', 'FB26_96', 'FB26_99', 'FB27_1', 'FB27_2', 'FB27_3', 'FB27_4', 'FB27_5', 'FB27_6', 'FB27_7', 'FB27_8', 'FB27_9', 'FB27_96', 'FB29_1', 'FB29_2', 'FB29_3', 'FB29_4', 'FB29_5', 'FB29_6', 'FB29_96', 'LN1A', 'LN1B', 'LN2_1', 'LN2_2', 'LN2_3', 'LN2_4', 'GN1', 'GN2', 'GN3', 'GN4', 'GN5']

7. Dictionary Filtering

We seperated the dictionary containg information of all columns we have in filtered data, given the columns information is available in full dictionary.

In [28]:
#seperating the dictionary, keeping only necessary columns dictionary
data_dictionaary_filtered = pd.DataFrame(data_dictionaary) #initiazing with same but will remove unnecessary in following loop


rows_indexes_to_delete = []
for index in range(len(data_dictionaary)):
    col_name = list(data_dictionaary_filtered['Column Name'])[index] #data_dictionaary.loc[ index , : ][0]
    if col_name not in data_filtered.columns:
        rows_indexes_to_delete.append(index)

        

print(len(rows_indexes_to_delete) ,len(data_filtered.columns))
data_dictionaary_filtered.drop(data_dictionaary_filtered.index[rows_indexes_to_delete], inplace=True)
len(data_dictionaary_filtered), data_dictionaary_filtered
    
    
731 384
Out[28]:
(374,      Column Name                                           Question  \
 0         Gender         DG2. Is the respondent a male or a female?   
 1            AA3                                          AA3. Zone   
 3            AA6                                 AA6. Village Class   
 10           DG3                  DG3. What is your marital status?   
 11          DG3A                             What is your religion?   
 ...          ...                                                ...   
 1086       FF7_6  FF7.6.How many of the following type of bank a...   
 1087       FF7_7  FF7.7.How many of the following type of bank a...   
 1088      FF7_96  FF7.96.How many of the following type of bank ...   
 1098       IFI18  IFI18.How many informal societies or group sav...   
 1099        FB13  FB13.How many times in the past 12 months have...   
 
                                                  Values  
 0                                      1=Male\n2=Female  
 1                      1=North\n2=East\n3=West\n4=South  
 3     6=Village Class 1\n7=Village Class 2\n8=Villag...  
 10    1=Singe/ Not married \n2=Polygamously married ...  
 11    1=Christianity\n2=Islam\n3=Sikhism\n4=Hinduism...  
 ...                                                 ...  
 1086                                         N/A\n99=DK  
 1087                                         N/A\n99=DK  
 1088                                         N/A\n99=DK  
 1098                                         N/A\n99=DK  
 1099                                         N/A\n99=DK  
 
 [374 rows x 3 columns])

C. Exporting relevant dictionary to file

In [29]:
data_dictionaary_filtered.to_excel (r'./data/filtered_dictionary.xlsx', engine='xlsxwriter')

8. Data Filling

We are doing filling in two steps:

  1. Filling with default values: We noticed that there are few columns we can fill with their default values instead of filling with mean or mode so to introduce bias. So we filled them with defaulat values which is described as DK(99) in dictionary. When we do gender prediction, we might consider replacing 99 with other number may be median value. But for now we are filling with 99 as per dictionary.

  2. Filling values with mode: For now we are filling the missing values with Mode, which mean considering the opinion of most of the people. We might revisit this method and consider filling with median instead of mode. i-e, 3 for domain values of [1,2,3,4,5]

In [30]:
data_dictionaary_filtered
Out[30]:
Column Name Question Values
0 Gender DG2. Is the respondent a male or a female? 1=Male\n2=Female
1 AA3 AA3. Zone 1=North\n2=East\n3=West\n4=South
3 AA6 AA6. Village Class 6=Village Class 1\n7=Village Class 2\n8=Villag...
10 DG3 DG3. What is your marital status? 1=Singe/ Not married \n2=Polygamously married ...
11 DG3A What is your religion? 1=Christianity\n2=Islam\n3=Sikhism\n4=Hinduism...
... ... ... ...
1086 FF7_6 FF7.6.How many of the following type of bank a... N/A\n99=DK
1087 FF7_7 FF7.7.How many of the following type of bank a... N/A\n99=DK
1088 FF7_96 FF7.96.How many of the following type of bank ... N/A\n99=DK
1098 IFI18 IFI18.How many informal societies or group sav... N/A\n99=DK
1099 FB13 FB13.How many times in the past 12 months have... N/A\n99=DK

374 rows × 3 columns

8.1. Identifying the column we can fill with default domain value

In [31]:
columns_to_be_filled_with_dk = []
for index in range(len(data_dictionaary_filtered)):
    domain_values = list(data_dictionaary_filtered['Values'])[index]#data_dictionaary_filtered.loc[ index , : ][2]
    col_name = list(data_dictionaary_filtered['Column Name'])[index]#data_dictionaary.loc[ index , : ][0]
    if '99=DK' in domain_values:
        #print(domain_values,"\n")
        #print(col_name)
        columns_to_be_filled_with_dk.append(col_name)
    
In [32]:
print("Following",len(columns_to_be_filled_with_dk),"can be filled with default value",columns_to_be_filled_with_dk)
Following 85 can be filled with default value ['DG3', 'DG3A', 'DG4', 'DG6', 'DL1', 'DL5', 'DL24', 'MT1A', 'MT5', 'MT6', 'MT6A', 'MT6B', 'FF6_1', 'FF6_2', 'FF6_3', 'FF6_4', 'FF6_5', 'FF6_6', 'FF6_7', 'FF6_8', 'FF6_9', 'FF6_10', 'IFI14_1', 'IFI14_2', 'IFI14_3', 'IFI14_4', 'IFI14_5', 'IFI14_6', 'IFI14_7', 'IFI15_1', 'IFI15_2', 'IFI15_3', 'IFI15_4', 'IFI15_5', 'IFI15_6', 'IFI15_7', 'IFI16_1', 'IFI16_2', 'IFI17_1', 'IFI17_2', 'FL4', 'FL7_1', 'FL7_2', 'FL7_3', 'FL7_4', 'FL7_5', 'FL7_6', 'FL10', 'FL11', 'FL12', 'FL13', 'FL14', 'FL15', 'FL16', 'FL17', 'FL18', 'FB19B_1', 'FB19B_2', 'FB19B_3', 'FB19B_4', 'FB19B_5', 'FB19B_96', 'GN1', 'GN2', 'GN3', 'GN4', 'GN5', 'DG1', 'DG8a', 'DG8b', 'DG8c', 'DG9a', 'DL11', 'MT1', 'MT6C', 'FF7_1', 'FF7_2', 'FF7_3', 'FF7_4', 'FF7_5', 'FF7_6', 'FF7_7', 'FF7_96', 'IFI18', 'FB13']

8.2. Filling with default value

For now we are considering 99 as per dictionary, we might consider replacing 99 with median value of column domain values.

In [33]:
for col_name in columns_to_be_filled_with_dk:
    #print(col_name)
    data_filtered[col_name] = data_filtered[col_name].replace(np.nan, 99)

8.3. Filling with mode value

For now we are considering mode value by keeping the opinion/answer of the most of the people. We might consider replacing it with median of domain values.

In [34]:
#Rest to be filled with mode 
for column in data_filtered.columns:
    data_filtered[column].fillna(data_filtered[column].mode()[0], inplace=True)

8.4. Post Filling Correlation Matrix Visualization

Visualizatin of correlation with gender after filling. We can see there's not much bias after filling the data

In [35]:
data_filtered_corr_matrix_fil = data_filtered.corr()
drawCorrelationChart("Gender",data_filtered_corr_matrix, "Correlation Before Filling")
plt.figure()
drawCorrelationChart("Gender",data_filtered_corr_matrix_fil, "Correlation After Filling")

8.5. Post filling visualzation of missing values

No value in the dat left unfilled

In [36]:
cols = data_filtered.columns
cols_length = len(cols)
for index in range(0,cols_length,chunk_size):
    set = data_filtered[cols[index:index+chunk_size]]
    msno.matrix(set, inline=False)

D. Storing the Cleaned/Preprocessed data in a sperate file

In [37]:
data_filtered.to_csv (r'./data/filtered_data.csv', index = False, header=True)

Deliverable 2

In [38]:
data_filtered=pd.read_csv("./data/filtered_data.csv",low_memory=False)
data_for_clustering = data_filtered.drop(['Gender'], axis=1)

Train and Test Split (Stratified)

Spliting data into Train and Test with 80/20 ratio.

Note :Test and Split is only for Classification, we are using whole data for Clustring

In [ ]:
 
In [39]:
from sklearn.model_selection import train_test_split 
train_data, test_data = train_test_split(data_filtered, test_size=0.20, stratify=data_filtered['Gender']) 


y_train = train_data.Gender
y_test = test_data.Gender

X_test = test_data.drop('Gender', axis = 1)
X_train = train_data.drop('Gender', axis = 1)

X_all = data_for_clustering
y_all = data_filtered.Gender


len(X_train),len(X_test), len(X_all), len(y_all)
Out[39]:
(9316, 2330, 11646, 11646)
In [40]:
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
In [ ]:
 
In [41]:
def printStats(algoResults):
    
    print('\nAlgorithm\t|TPs\t|TNs\t|FPs\t|FNs\t\t|F1-Score\t|Accuracy\t|Precision\t|ROC/AUC','\n'+'================================================================================================================')
    for algoResults_i in algoResults:
        confusionMatrix = algoResults_i[1]
        algoName = algoResults_i[0]
        
        
        tp = int(confusionMatrix[0][0])
        fp = int(confusionMatrix[0][1])
        fn = int(confusionMatrix[1][0])
        tn = int(confusionMatrix[1][1])


        precision = tp/(tp+fp)
        recall = tp/(tp+fn)
        accuracy = (tp+tn)/(tp+tn+fp+fn)
        f1Score = 2*((precision * recall) / (precision + recall))
        roc_auc = algoResults_i[2]
        
        dec_fig = 2
        precision= round(precision, dec_fig)
        recall= round(recall, dec_fig)
        accuracy= round(accuracy,dec_fig )
        f1Score= round(f1Score, dec_fig)
        roc_auc= round(roc_auc, dec_fig)


        precision= round(precision*100, dec_fig)
        recall= round(recall*100, dec_fig)
        accuracy= round(accuracy*100,dec_fig )
        f1Score= round(f1Score*100, dec_fig)
        roc_auc= round(roc_auc*100, dec_fig)
        

        precision = str(precision)+"%"
        recall = str(recall)+"%"
        accuracy = str(accuracy)+"%"
        f1Score = str(f1Score)+"%"
        roc_auc = str(roc_auc)+"%"
        
        print(algoName+'\t|'+str(tp)+'\t|'+str(tn)+'\t|'+str(fp)+'\t|'+str(fn)+'\t\t|'+str(f1Score)+'    \t|'+str(accuracy)+'   \t|'+str(precision)+'    \t|'+str(roc_auc))
        #print('------------------------------------------------------------------------------------------------------------------')
    

    
    
In [42]:
def getConfusionMatix(actualLabels, predictedLabels):
    confusionMatrix = np.zeros([2,2])
    for index, label in enumerate(actualLabels):
        goldLabel  = int(label)
        predictedLabel = int(predictedLabels[index])
        #if not goldLabel == predictedLabel:
        #print('goldLabel', goldLabel, "predictedLabel",predictedLabel,'@',index)
        if goldLabel == predictedLabel:
            #print("Correct Prediction")
            if(goldLabel):
                confusionMatrix[0][0] = confusionMatrix[0][0] +1
                #print("TP for")
            else:
                confusionMatrix[1][1] = confusionMatrix[1][1] +1
                #print("TN for")
        else:
            if(goldLabel):
                confusionMatrix[1][0] = confusionMatrix[1][0] +1
            else:
                confusionMatrix[0][1] = confusionMatrix[0][1] +1

    return confusionMatrix
    
In [43]:
X_train_values = X_train.values
#X_train_values = StandardScaler().fit_transform(X_train_values) 
In [44]:
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Task 1: Gender Prediction

Kmean Clustering

In [45]:
def get_Kmeans_clusters_labels(data, k, n_init):
    # Initialize
    k_means = KMeans(init="k-means++", n_clusters=k, n_init=n_init)
    k_means.fit(data)
    return k_means.predict(data), k_means.inertia_
In [46]:
predicted_labels_kmeans, inertia = get_Kmeans_clusters_labels(X_all, 2, n_init=12)

Agglomerative Clustring

Executing twice with two differnt linkages

  • Complete/Max
  • Ward
In [47]:
ac = AgglomerativeClustering(n_clusters =2, affinity = 'euclidean', linkage = 'complete')
predicted_labels_hc_complete = ac.fit_predict(X_all)

ac = AgglomerativeClustering(n_clusters =2, affinity = 'euclidean', linkage = 'ward')
predicted_labels_hc_ward = ac.fit_predict(X_all)

DBScan Clustring

Settings for DBScan

  • eps(radius) = 2
  • n_eps = 5
In [48]:
dbscan_clustering = DBSCAN(eps=300, min_samples=100)
predicted_labels_dbscan = dbscan_clustering.fit_predict(X_all)

Clustering Algorithms goodness measures

In [49]:
algoResults_clustering = []
#cm_rfc = getConfusionMatix(y_test,predictions )
#algoResults.append(['RFC\t', cm_rfc])

cm_keams = getConfusionMatix(y_test,predicted_labels_kmeans )
roc_score= roc_auc_score(y_all, predicted_labels_kmeans)
algoResults_clustering.append(['Kmeans\t', cm_keams, roc_score])


cm_ac_ward = getConfusionMatix(y_test,predicted_labels_hc_ward )
roc_score= roc_auc_score(y_all, predicted_labels_hc_ward)
algoResults_clustering.append(['Agglo-Ward', cm_ac_ward, roc_score])

cm_ac_complete = getConfusionMatix(y_test,predicted_labels_hc_complete )
roc_score= roc_auc_score(y_all, predicted_labels_hc_complete)
algoResults_clustering.append(['Agglo-Max', cm_ac_complete, roc_score])

cm_dbscan = getConfusionMatix(y_test,predicted_labels_dbscan )
roc_score= roc_auc_score(y_all, predicted_labels_dbscan)
algoResults_clustering.append(['DBScan\t', cm_dbscan, roc_score])


printStats(algoResults_clustering)
Algorithm	|TPs	|TNs	|FPs	|FNs		|F1-Score	|Accuracy	|Precision	|ROC/AUC 
================================================================================================================
Kmeans		|584	|619	|597	|530		|51.0%    	|52.0%   	|49.0%    	|55.0%
Agglo-Ward	|567	|632	|584	|547		|50.0%    	|51.0%   	|49.0%    	|54.0%
Agglo-Max	|300	|928	|288	|814		|35.0%    	|53.0%   	|51.0%    	|50.0%
DBScan		|19	|13	|1203	|1095		|2.0%    	|1.0%   	|2.0%    	|49.0%

Comments on Clustering results

Clustering algorithm didn't perfom well to predict the gender. We can try some classfication and regression algorithm to get better results for gender prediciton.

Prediction Improvements

We are using following three algorithms to improved gender prediction task.

  • Classfication:
    • KNN with k = {3, 7, 10}
    • Random Forest
  • Regression:
    • Logistic Regression
      • With two different loss functions (Ridge Penalty and Lasso(l2) Penalty)

K Nearest Neighbours

We are executing for 3 values of K

  • K= {3,5,7}
In [148]:
def executeKNN(X_test, y_train, K):
    knn_classifier = KNeighborsClassifier(n_neighbors=K)
    knn_classifier.fit(X_train, y_train)
    labels_knn  = knn_classifier.predict(X_test)
    return labels_knn

Feature Importance to extract most important variables

In [142]:
def get_feature_importance(model, features):
    if not hasattr(model, "coef_") and not hasattr(model, "feature_importances_"):
        raise Exception("Not possible to collect feature importances")

    if hasattr(model, "coef_"):
        model_feature_importances = model.coef_[0]
    elif hasattr(model, "feature_importances_"):
        model_feature_importances = model.feature_importances_

    return [
        (feature, importance)
        for feature, importance in sorted(zip(features, model_feature_importances),
            key=lambda pair: pair[1],
            reverse=True,
        )
    ]
In [143]:
def plotFeatureImportance(model, X_train, threshold= 0.00001):
    fi = get_feature_importance(model, X_train.columns.values)
    #print(fi)
    pd_fi = pd.DataFrame(fi, columns=["feature", "importance"])
    pd_fi = pd_fi[(pd_fi.importance >threshold )]
    #print(pd_fi)
    p = plt.barh(pd_fi["feature"], pd_fi["importance"])
    

Random Forest Classifier

In [ ]:
 
In [144]:
rfc = RandomForestClassifier(random_state=42, n_estimators=100, n_jobs=-1, criterion='entropy')
rfc.fit(X_train, y_train)
labels_rfc = rfc.predict(X_test)
plotFeatureImportance(rfc, X_train, threshold = 0.01)

Logistic Regression Classfier- Lasso Penalty(L2)

In [145]:
logReg = LogisticRegression(max_iter = 100, random_state=0,penalty= 'l2',solver='saga')
logReg.fit(X_train, y_train)
labels_logReg_l2 = logReg.predict(X_test)
plotFeatureImportance(logReg, X_train, )


    
/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
In [ ]:
 

Logistic Regression Classfier- Ridge Penalty(L1)

In [146]:
#max_iterations are set high intentionally,becuase data has large number of columns.
#so, model should be given enough iterations to complete training process
logReg = LogisticRegression(max_iter = 100, random_state=0,penalty= 'l2',solver='saga')
logReg.fit(X_train, y_train)
labels_logReg_l1 = logReg.predict(X_test)
plotFeatureImportance(logReg, X_train)
/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/sag.py:337: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  "the coef_ did not converge", ConvergenceWarning)
In [151]:
classification_results = []

cm_rfc = getConfusionMatix(y_test,labels_rfc )
roc_score= roc_auc_score(y_test, labels_rfc)
classification_results.append(['RFC\t',cm_rfc, roc_score ])

cm_logReg_l1 = getConfusionMatix(y_test,labels_logReg_l1 )
roc_score= roc_auc_score(y_test, labels_logReg_l1)
classification_results.append(['LogReg-L1',cm_logReg_l1,roc_score ])

cm_logReg_l2 = getConfusionMatix(y_test,labels_logReg_l2 )
roc_score= roc_auc_score(y_test, labels_logReg_l2)
classification_results.append(['LogReg-L2',cm_logReg_l2,roc_score ])


k = [3, 7, 10]
for k_i in k:
    labels_knn =executeKNN(X_test, y_train, k_i)
    cm_knn = getConfusionMatix(y_test,labels_knn )
    roc_score= roc_auc_score(y_test, labels_knn)
    classification_results.append(['KNN-'+(str(k_i))+'\t', cm_knn, roc_score])

printStats(classification_results)
Algorithm	|TPs	|TNs	|FPs	|FNs		|F1-Score	|Accuracy	|Precision	|ROC/AUC 
================================================================================================================
RFC		|982	|1132	|84	|132		|90.0%    	|91.0%   	|92.0%    	|91.0%
LogReg-L1	|90	|1168	|48	|1024		|14.0%    	|54.0%   	|65.0%    	|52.0%
LogReg-L2	|90	|1168	|48	|1024		|14.0%    	|54.0%   	|65.0%    	|52.0%
KNN-3		|665	|888	|328	|449		|63.0%    	|67.0%   	|67.0%    	|66.0%
KNN-7		|646	|916	|300	|468		|63.0%    	|67.0%   	|68.0%    	|67.0%
KNN-10		|564	|987	|229	|550		|59.0%    	|67.0%   	|71.0%    	|66.0%

Task 2. Patterns Associated with gender

Just displaying the patterns assiciate with gender with columns identifided as important columns[highly weighted features: those play major role in prediction] above while training RFC, LogRegression models.

Patterns for features declared important by LogRegression- Ridge Penalty

In [153]:
table = pd.crosstab(index=data_filtered["MT6"], 
                          columns=data_filtered["Gender"])

table
Out[153]:
Gender 0 1
MT6
1.0 4444 941
2.0 69 1265
3.0 287 181
4.0 71 135
5.0 79 99
6.0 14 13
7.0 14 6
96.0 9 2
99.0 1093 2924
In [154]:
fig = plt.figure()

fig = table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)


plt.xlabel('MT6', fontsize=18)
plt.ylabel('counts', fontsize=16)
plt.title('MT6 with respect to Gender')
Out[154]:
Text(0.5, 1.0, 'MT6 with respect to Gender')
<Figure size 432x288 with 0 Axes>
In [155]:
"""
This variable shows a clear differentiation between the two genders. The males have bought phones by 
themselves most of the time. Some other person have been purchased it for females in many occasions.

"""
Out[155]:
'\nThis variable shows a clear differentiation between the two genders. The males have bought phones by \nthemselves most of the time. Some other person have been purchased it for females in many occasions.\n\n'
In [156]:
#MT6B - Where you charge the battery

ax = sns.catplot(x='MT6B',kind='count',data=data_filtered,orient="h", hue = 'Gender')
ax.fig.autofmt_xdate()
In [157]:
table = pd.crosstab(index=data_filtered["MT6B"], 
                          columns=data_filtered["Gender"])

table
Out[157]:
Gender 0 1
MT6B
1.0 4872 2566
2.0 101 99
3.0 5 4
4.0 26 7
5.0 8 6
6.0 2 3
99.0 1066 2881
In [158]:
"""
Most of the people have been charging it either from home or don't know exactly. It seems like most of the
males do have a exact place to charge their phone and for most of the females, they don't have a idea about that.
When looking at the detailes more closely, a very less number of people charge their phones at work. There are even 14 
people who charge their phones at retail shops.

"""
Out[158]:
"\nMost of the people have been charging it either from home or don't know exactly. It seems like most of the\nmales do have a exact place to charge their phone and for most of the females, they don't have a idea about that.\nWhen looking at the detailes more closely, a very less number of people charge their phones at work. There are even 14 \npeople who charge their phones at retail shops.\n\n"
In [159]:
#MT6A - How frequently charge your phone

ax = sns.catplot(x='MT6A',kind='count',data=data_filtered,orient="h", hue = 'Gender')
ax.fig.autofmt_xdate()
In [160]:
"""
People either charge their phones daily or don't exactly know in most of the times.
Most of the females don't know about charging the phone or they don't keep a track about it it seems.

"""
Out[160]:
"\nPeople either charge their phones daily or don't exactly know in most of the times.\nMost of the females don't know about charging the phone or they don't keep a track about it it seems.\n\n"
In [161]:
"""
When these two variables are combined, it can be seen that the people who charge their phones frequently are 
charging it at their homes while the people who don't know about charging don't know where it's been charging.
This explains the integrity of survey data too.
"""

table = pd.crosstab(index=data_filtered["MT6C"], 
                          columns=data_filtered["MT6A"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)
plt.xlabel('MT6C', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('MT6A with respect to MT6C')
Out[161]:
Text(0.5, 1.0, 'MT6A with respect to MT6C')
In [162]:
#MT5 - purpose to obtain a Mobile phone

table = pd.crosstab(index=data_filtered["MT5"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('MT5', fontsize=18)
plt.ylabel('counts', fontsize=16)
plt.title('MT5 with respect to Gender')

"""
Both males and females have been obtained the mobile phone for nearly same reasons.
There is no clear difference between the gender when it comes to the purpose of buying it.
"""
Out[162]:
'\nBoth males and females have been obtained the mobile phone for nearly same reasons.\nThere is no clear difference between the gender when it comes to the purpose of buying it.\n'
In [163]:
#MT6C -  cost each time to charge your phone’s

sns.catplot(x="Gender",y="MT6C",data=data_filtered)

"""
The cost of charge is a constant for any person. It does seems to vary much with the gender.
"""
Out[163]:
'\nThe cost of charge is a constant for any person. It does seems to vary much with the gender.\n'
In [164]:
#DG1 -  Birth year

sns.catplot(x="Gender",y="DG1",data=data_filtered, kind = 'box')

"""
Most of the people who participated the survey are born in the years from 1970 - 1990. There are fe people who
have born before 1940. They cannot considered as outliers since these are possible data points.
"""
Out[164]:
'\nMost of the people who participated the survey are born in the years from 1970 - 1990. There are fe people who\nhave born before 1940. They cannot considered as outliers since these are possible data points.\n'
In [165]:
#IFI17_2 - How easy to go to the nearest ATM

table = pd.crosstab(index=data_filtered["IFI17_2"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('IFI17_2', fontsize=18)
plt.ylabel('counts', fontsize=16)
plt.title('IFI17_2 with respect to Gender')
"""
THis variable does not seems to affect with the gender. The difficulty level is mostly similar 
for both genders
"""
Out[165]:
'\nTHis variable does not seems to affect with the gender. The difficulty level is mostly similar \nfor both genders\n'

Patterns for feature declared important by Random Forest Classifier

Few of the features are common. For distinct features patterns are displayed below.

In [168]:
#GN1 - who decides how the money you earn will be used
table = pd.crosstab(index=data_filtered["GN1"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('GN1', fontsize=18)
plt.ylabel('counts', fontsize=16)
plt.title('GN1 with respect to Gender')
"""
According this plot, it is clear that the most of males decide about spending my by themselves.
But for females its different. The decisions are taken by the spouse only for less number of males 
with compared to the females in the same category.
"""
Out[168]:
'\nAccording this plot, it is clear that the most of males decide about spending my by themselves.\nBut for females its different. The decisions are taken by the spouse only for less number of males \nwith compared to the females in the same category.\n'

Patterns for feature declared important by LogReg-Lasso Penalty

All but one feautes are common.

In [169]:
#DL0 - The main income owner of the household
table = pd.crosstab(index=data_filtered["DL0"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)
plt.xlabel('DL0', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('DL0 with respect to Gender')

"""
#Most of the people fall in to the categories 'myself' and 'somebody else' are males and females respectively.
#It implies that, males are the main source of income in the households most of the time. For females, 
#its somebody other than them

"""
Out[169]:
"\n#Most of the people fall in to the categories 'myself' and 'somebody else' are males and females respectively.\n#It implies that, males are the main source of income in the households most of the time. For females, \n#its somebody other than them\n\n"
In [170]:
#DG6 - relation with the household head

table = pd.crosstab(index=data_filtered["DG6"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)
plt.xlabel('DG6', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('DG6 with respect to Gender')

"""

Most of the peole answered the survey are the head of the household. Among them, the most are males.
Out of the people fall in to the second category (spouse), most of them are females.
This implies that the head of the most households are Males.
"""
Out[170]:
'\n\nMost of the peole answered the survey are the head of the household. Among them, the most are males.\nOut of the people fall in to the second category (spouse), most of them are females.\nThis implies that the head of the most households are Males.\n'
In [171]:
#DL1 - WORKING STATUS

table = pd.crosstab(index=data_filtered["DL1"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('DL1', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('DL1 with respect to Gender')

"""
Most of the people who were working full time for a regular salary are males while most of the people 
who stayed at the home are females. Most of the working peaple are also males.

This clearly shows that, males are working and earning money with compared to the females.

"""
Out[171]:
'\nMost of the people who were working full time for a regular salary are males while most of the people \nwho stayed at the home are females. Most of the working peaple are also males.\n\nThis clearly shows that, males are working and earning money with compared to the females.\n\n'
In [172]:
#MT1A - Decisions about having a phone

table = pd.crosstab(index=data_filtered["MT1A"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('MT1A', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('MT1A with respect to Gender')

"""
Most of the males have decide by themselves to have a phone whereas for females
it has been decided by her spouse most of the time. None of the sisters have decide it 
for any one whereas the parents have decided it for both males and females in neary equal proportions.

"""
Out[172]:
'\nMost of the males have decide by themselves to have a phone whereas for females\nit has been decided by her spouse most of the time. None of the sisters have decide it \nfor any one whereas the parents have decided it for both males and females in neary equal proportions.\n\n'
In [173]:
#FL4 - On who you do depend for most financial advices

ax = sns.catplot(x='FL4',kind='count',data=data_filtered,orient="h", hue = 'Gender')
ax.fig.autofmt_xdate()

"""
Most of the males depends on themselves when it comes to financial advices. Females depends ond their spouses.
Earlier it was found that the main source of income for most households are males. This might be the reason for 
females to depend on their spouse.
"""
Out[173]:
'\nMost of the males depends on themselves when it comes to financial advices. Females depends ond their spouses.\nEarlier it was found that the main source of income for most households are males. This might be the reason for \nfemales to depend on their spouse.\n'
In [174]:
#DG3 - On who you do depend for most financial advices

ax = sns.catplot(x='DG3',kind='count',data=data_filtered,orient="h", hue = 'Gender')
ax.fig.autofmt_xdate()

"""
The number of single men are higher with compared to the number of females. The number of Monogamously married 
men are also higher than the number of females. From the people in the category widow, least of them are males.
"""
Out[174]:
'\nThe number of single men are higher with compared to the number of females. The number of Monogamously married \nmen are also higher than the number of females. From the people in the category widow, least of them are males.\n'
In [175]:
#GN5 - Who decides what kind of financial services you can personally use

table = pd.crosstab(index=data_filtered["GN5"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('GN5', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('GN5 with respect to Gender')


"""
Most of the people among the people who decided for themselves are males. The decisons have been taken by the spouse for 
most of th females with comapred to the males. There are a considereable number of people who take their decisions
jointly
"""
Out[175]:
'\nMost of the people among the people who decided for themselves are males. The decisons have been taken by the spouse for \nmost of th females with comapred to the males. There are a considereable number of people who take their decisions\njointly\n'
In [176]:
#GN5 - Who decides what kind of financial services you can personally use

table = pd.crosstab(index=data_filtered["GN5"], 
                          columns=data_filtered["Gender"])
table.plot(kind="bar", 
                 figsize=(8,8),
                 stacked=True)

plt.xlabel('GN5', fontsize=12)
plt.ylabel('counts', fontsize=12)
plt.title('GN5 with respect to Gender')


"""
Most of the people among the people who decided for themselves are males. The decisons have been taken by the spouse for 
most of th females with comapred to the males. There are a considereable number of people who take their decisions
jointly
"""
Out[176]:
'\nMost of the people among the people who decided for themselves are males. The decisons have been taken by the spouse for \nmost of th females with comapred to the males. There are a considereable number of people who take their decisions\njointly\n'
In [177]:
#AA3 - Zone

ax = sns.catplot(x='AA3',kind='count',data=data_filtered,orient="h", hue = 'Gender')
ax.fig.autofmt_xdate()

Task 3: Correlation between diffent variables

Feature selection using Backward Elimination

In [179]:
y = data_filtered['Gender']
X = data_filtered.drop(['Gender'], axis = 1)
In [180]:
import statsmodels.api as sm

#Backward Elimination
cols = list(X.columns)
pmax = 1
while (len(cols)>0):
    p= []
    X_1 = X[cols]
    X_1 = sm.add_constant(X_1)
    model = sm.OLS(y,X_1).fit()
    p = pd.Series(model.pvalues,index = cols)      
    pmax = max(p)
    feature_with_p_max = p.idxmax()
    
    if(pmax>0.001):
        cols.remove(feature_with_p_max)
    else:
        break
selected_features_BE = cols
print(selected_features_BE)
/opt/anaconda3/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
['AA3', 'AA4', 'AA7', 'DG1', 'DG3', 'DG4', 'DG5_4', 'DG5_5', 'DG5_6', 'DL0', 'DL4_3', 'DL4_5', 'DL4_6', 'DL4_18', 'DL5', 'DL14', 'DL26_5', 'MT1', 'MT4_1', 'MT4_2', 'MT4_5', 'MT6', 'MT10', 'MT17_3', 'MT17_5', 'MT18_1', 'MT18_3', 'MT18_4', 'MT18_5', 'MT18_8', 'FF2', 'FF7_2', 'FF7_5', 'MM1', 'MMP1_8', 'IFI1_1', 'IFI3_3', 'IFI1_5', 'IFI17_2', 'FL1', 'FL4', 'FL6_2', 'FL13', 'FL14', 'FB16_6', 'FB26_10', 'FB27_1', 'LN2_4', 'GN1', 'GN2']
In [182]:
#50 another features were selected by backward elimination
len(selected_features_BE)
Out[182]:
50
In [183]:
#Features which are not in the data description (can be eliminated)
#AA7, AA4
In [184]:
#dg4 - education level
#DL0 - INCOME SOURCE
#
In [181]:
#DRop data without a description
df = data_filtered.drop(['AA7','AA4'], axis = 1)
In [185]:
import itertools
import scipy.stats as ss
In [186]:
cols = ['AA3', 'DG3', 'DG4', 'DG5_4', 'DG5_5', 'DG5_6', 'DL0', 'DL4_3', 'DL4_5', 'DL4_6', 'DL4_18', 'DL5', 'DL14', 'DL26_5', 'MT1', 'MT4_1', 'MT4_2', 'MT4_5', 'MT6', 'MT10', 'MT17_3', 'MT17_5', 'MT18_1', 'MT18_3', 'MT18_4', 'MT18_5', 'MT18_8', 'FF2', 'MM1', 'MMP1_8', 'IFI1_1', 'IFI3_3', 'IFI1_5', 'IFI17_2', 'FL1', 'FL4', 'FL6_2', 'FL13', 'FL14', 'FB16_6', 'FB26_10', 'FB27_1', 'LN2_4', 'GN1', 'GN2']

correlation between categorical features

In [187]:
def cramers_corrected_stat(confusion_matrix):
    """ calculate Cramers V statistic for categorical-categorical association.
        uses correction from Bergsma and Wicher, 
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))
In [188]:
cols = cols
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]
In [189]:
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(25, 20))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

its clearly visible that some variables don't correlate with any other.

we can remove them to visualize other variables easily

In [190]:
cols = [ 'Gender','DG3', 'DG4', 'DG5_4', 'DG5_5', 'DG5_6', 'DL0', 'DL4_3', 'DL4_5', 'DL4_6', 'DL4_18', 'DL5', 'MT1', 'MT4_1', 'MT4_2', 'MT4_5', 'MT6', 'MT10', 'MT17_3', 'MT17_5', 'MT18_1', 'MT18_3', 'MT18_4',  'MT18_8', 'FF2', 'MM1',  'IFI3_3',  'IFI17_2', 'FL4', 'FL6_2', 'FL13', 'FL14', 'FB16_6', 'FB26_10',  'LN2_4', 'GN1', 'GN2']
In [191]:
cols = cols
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]
In [192]:
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(28, 28))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

are as follows

-GN1 -GN2 -FL4 -MT6 -DL0

DL5 (source of income) and DL4_18, DL4_3, DL4_5, DL4_6 are highly correlated. SO, DL4_18, DL4_3, DL4_5, DL4_6 are categories of DL5. so we can remove those variable keeping the DL5 variable.

In [196]:
cols = [ 'Gender','DG3', 'DG4', 'DG5_4', 'DG5_5', 'DG5_6', 'DL0',  'DL5', 'MT1', 'MT4_1', 'MT4_2', 'MT4_5', 'MT6', 'MT10', 'MT17_3', 'MT17_5', 'MT18_1', 'MT18_3', 'MT18_4',  'MT18_8', 'FF2', 'MM1',  'IFI3_3',  'IFI17_2', 'FL4', 'FL6_2', 'FL13', 'FL14', 'FB16_6', 'FB26_10',  'LN2_4', 'GN1', 'GN2']
In [197]:
cols = cols
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]

    
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(28, 28))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

browsing internet

Also browsing internet and having a touch screen is seems to be highly related. GN1 and GN2 are correlated. The person who decides on use of mney decides about the purchases made for the house hold also. Hence, GN1 describes the variation describes by GN2 as well. Hence. one variable can be removed.

In [198]:
#removing

removing_cols = ['DL4_18', 'DL4_3', 'DL4_5', 'DL4_6','GN2', 'MT4_5']
#also we can add the variables we identified earlier by feature selction to identify correlations
cols = [ 'Gender','DG3', 'DG4', 'DG5_4', 'DG5_5', 'DG5_6', 'DG6','DL1',
        'DL0',  'DL5', 'MT1', 'MT1A','MT4_1', 'MT4_2', 'MT6','MT6B','MT6A','MT6C','MT5', 'MT10', 
        'MT17_3', 'MT17_5', 'MT18_1', 'MT18_3', 'MT18_4',  'MT18_8', 
        'FF2', 'MM1',  'IFI3_3',  'IFI17_2', 'FL4', 'FL6_2', 'FL13', 'FL14', 
        'FB16_6', 'FB26_10',  'LN2_4', 'GN1', 'GN3', 'GN5']
In [199]:
cols = cols
corrM = np.zeros((len(cols),len(cols)))
# there's probably a nice pandas way to do this
for col1, col2 in itertools.combinations(cols, 2):
    idx1, idx2 = cols.index(col1), cols.index(col2)
    corrM[idx1, idx2] = cramers_corrected_stat(pd.crosstab(df[col1], df[col2]))
    corrM[idx2, idx1] = corrM[idx1, idx2]

    
corr = pd.DataFrame(corrM, index=cols, columns=cols)
fig, ax = plt.subplots(figsize=(28, 28))
ax = sns.heatmap(corr, annot=True, ax=ax); ax.set_title("Cramer V Correlation between Variables");

Highly correlated variables

DL0 and DG6 GN5 and GN6

Variables associated with gender

-GN1 -FL4 -MT6 -MT1A -DL0 -DL1 -DG6 -FL4 -GN5 -GN3

In [ ]: